03-040. 데이터 타입 변환

데이터 타입 변환 - Data Type Conversion

데이터 타입 변환은 데이터 분석, 시각화, 모델링 등 다양한 작업에서 반드시 거쳐야 하는 과정이다. 예를 들어, 숫자처럼 보이지만 문자열로 저장된 데이터는 수치 연산이나 통계 분석이 불가능하다. 날짜 정보가 문자열로 저장되어 있으면 시간 계산이나 정렬이 어렵다.

잘못된 타입은 다음과 같은 문제를 일으킨다.

메모리 비효율성
프로그램 오류 발생
잘못된 결과 해석

따라서 데이터의 의미에 맞는 올바른 타입으로 변환하는 것이 데이터 전처리의 핵심 단계가 된다.

데이터 타입 - Data Type

데이터 타입(Data Type)은 각 데이터가 컴퓨터 내에서 어떻게 저장되고, 어떤 연산이 가능한지를 결정하는 속성이다. Python과 Pandas에서 자주 사용하는 주요 데이터 타입은 다음과 같다.

int / int64: 정수형 (예: 1, 100, -5)
float / float64: 실수형 (예: 3.14, -0.001)
bool: 불리언(참/거짓) (예: True, False)
object / string: 문자열 (예: "apple", "2023-01-01")
datetime64: 날짜/시간 (예: 2023-01-01 12:00:00)
category: 범주형 (예: "A", "B", "C" 등 소수의 고정된 값)

ℹ️알아두기: 판다스의 object 타입은 주로 문자열 데이터를 의미하지만, 실제로는 다양한 파이썬 객체가 들어갈 수 있다.
최근에는 문자열 전용 타입인 string도 지원한다.

데이터 타입 확인 방법

Python 내장 함수와 Pandas의 함수를 사용해 데이터 타입을 확인할 수 있다.

import pandas as pd
import numpy as np

# 샘플 데이터 생성
data = {
    'name': ['정하늘', '윤서진', '강도현'],
    'age': [25, 30, 35],
    'height': [175.5, 162.3, 180.0],
    'is_student': [True, False, False],
    'join_date': ['2025-01-15', '2025-02-20', '2025-03-10']
}

df = pd.DataFrame(data)

# 1. 전체 데이터프레임의 타입 확인
print("데이터프레임 정보:")
print(df.info())
print()

# 2. 각 컬럼별 타입 확인
print("각 컬럼의 데이터 타입:")
print(df.dtypes)
print()

# 3. 특정 컬럼의 타입 확인
print(f"name 컬럼의 타입: {df['name'].dtype}")
print(f"age 컬럼의 타입: {df['age'].dtype}")
print()

# 4. 파이썬 내장 함수로 개별 값 타입 확인
print("개별 값의 타입:")
print(f"첫 번째 이름의 타입: {type(df['name'][0])}")
print(f"첫 번째 나이의 타입: {type(df['age'][0])}")
print()

# 5. 메모리 사용량 확인
print("메모리 사용량:")
print(df.memory_usage(deep=True))

데이터프레임 정보:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        3 non-null      object 
 1   age         3 non-null      int64  
 2   height      3 non-null      float64
 3   is_student  3 non-null      bool   
 4   join_date   3 non-null      object 
dtypes: bool(1), float64(1), int64(1), object(2)
memory usage: 231.0+ bytes
None

각 컬럼의 데이터 타입:
name           object
age             int64
height        float64
is_student       bool
join_date      object
dtype: object

name 컬럼의 타입: object
age 컬럼의 타입: int64

개별 값의 타입:
첫 번째 이름의 타입: <class 'str'>
첫 번째 나이의 타입: <class 'numpy.int64'>

메모리 사용량:
Index         132
name          264
age            24
height         24
is_student      3
join_date     201
dtype: int64

데이터 타입 문제가 생기는 이유

CSV 파일 읽기 시 자동 타입 추론 오류
결측치나 특수 문자가 포함된 숫자 데이터
날짜 형식이 일관되지 않은 경우
숫자로 저장되어야 할 범주형 데이터
메모리 효율성을 고려하지 않은 타입 선택

데이터 타입 변환은 단순히 형태만 바꾸는 것이 아니다. 데이터의 의미와 용도를 정확히 반영해야 하므로 신중하게 접근해야 한다.

ℹ️ 데이터 타입 vs 데이터 형식
데이터 타입(Data Type)은 컴퓨터가 인식하는 데이터의 종류이고, 데이터 형식(Data Format)은 사람이 보는 데이터의 표현 방식이다. 예를 들어 '2025-03-12'은 문자열 타입이지만 날짜 형식이다.

데이터 타입 문제 예시

데이터에서 타입 문제가 있는 몇가지 예시를 살펴보자.

게임 플레이어 데이터 예시

플레이어ID	레벨	경험치	게임시간	프리미엄
P001	45	125,500	3시간 30분	true
P002	32.0	89500	210분	false
P003	Lv.28	67.8K	02:45:30	1
P004	MAX	N/A	4.5h	0

주식 거래 데이터 예시

종목코드	거래일	종가	거래량	등락률
005930	2025-12-01	75,000원	1,234,567	+2.5%
AAPL	12/01/2025	$195.89	2.5M	-1.2
000660	20251201	85500	987,654주	0.8%
TSLA	2025.12.01	240.83	1500000	flat

의료 검진 데이터 예시

환자번호	혈압	체중	혈당	검진일
H001	120/80	70.5kg	정상	2025-11-15
H002	140-90	68	110mg/dL	11/16/2025
H003	정상	72.3	95	20251117
H004	130/85mmHg	75.0	경계	2025년 11월 18일

동일한 유형이지만 포맷이 다른 값들이 있다. 데이터 플랫폼에서 데이터를 가져오는 경우는 이런 일이 잘 발생하지 않는다. CSV 파일을 가져오거나 사람이 입력한 데이터, 웹스트래핑한 데이터에 이런 문제가 발생한다. 이런 것은 일관된 데이터 타입으로 변환해야 정확한 분석이 가능하다.

데이터의 예시는 AI를 이용해서 확인해 볼 수 있다.

❓AI 프롬프트: 데이터 분석에서 데이터 타입이 문제가 되는 데이터의 예를 알려주세요.

주요 데이터 타입

Pandas에서 사용하는 주요 데이터 타입들을 살펴보자.

숫자형 (Numeric)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 샘플 데이터 생성
data = {
    'int_col': ['1', '2', '3', '4', '5'],
    'float_col': ['1.5', '2.7', '3.14', '4.0', '5.99'],
    'mixed_col': ['100', '200.5', '300', 'N/A', '500'],
    'currency_col': ['₩1,000', '₩2,500', '₩3,200', '₩4,100', '₩5,000']
}

df = pd.DataFrame(data)
print("원본 데이터 타입:")
print(df.dtypes)
print("\n원본 데이터:")
print(df)

정수형 변환

# 정수형 변환
df['int_converted'] = pd.to_numeric(df['int_col'], errors='coerce').astype('Int64')

print(f"변환 전 타입: {df['int_col'].dtype}")
print(f"변환 후 타입: {df['int_converted'].dtype}")
print(f"변환 결과:\n{df[['int_col', 'int_converted']]}")

실수형 변환

# 실수형 변환
df['float_converted'] = pd.to_numeric(df['float_col'], errors='coerce')

print(f"변환 전 타입: {df['float_col'].dtype}")
print(f"변환 후 타입: {df['float_converted'].dtype}")
print(f"변환 결과:\n{df[['float_col', 'float_converted']]}")

복잡한 숫자 데이터 처리

# 통화 기호와 쉼표가 있는 데이터 처리
df['currency_cleaned'] = df['currency_col'].str.replace('$', '').str.replace(',', '')
df['currency_converted'] = pd.to_numeric(df['currency_cleaned'], errors='coerce')

print("통화 데이터 변환:")
print(df[['currency_col', 'currency_cleaned', 'currency_converted']])

# 결측치가 포함된 데이터 처리
df['mixed_converted'] = pd.to_numeric(df['mixed_col'], errors='coerce')

print("\n결측치 포함 데이터 변환:")
print(df[['mixed_col', 'mixed_converted']])
print(f"결측치 개수: {df['mixed_converted'].isna().sum()}")

문자열 (String/Object) 타입 변환

# 문자형 데이터 예제
text_data = {
    'name': ['최민준', '한지우', '송예린', '강태현'],
    'code': [1001, 1002, 1003, 1004],
    'category': ['A', 'B', 'A', 'C'],
    'description': ['좋은 제품', '보통 제품', '우수 제품', '나쁜 제품']
}

df_text = pd.DataFrame(text_data)

# 숫자를 문자로 변환
df_text['code_str'] = df_text['code'].astype(str)

# 명시적 문자형 타입 사용 (pandas 1.0+)
df_text['name_string'] = df_text['name'].astype('string')

print("문자형 변환 결과:")
print(df_text.dtypes)
print(df_text)

날짜/시간형 (Datetime) 타입 변환

날짜 타입은 데이터 변환에서 문제를 자주 일으키는 타입이다. 특히 포맷, 타임존, 써머타임과 같은 문제를 만드는 요소가 많다.

# 다양한 날짜 형식 데이터
date_data = {
    'date1': ['2023-01-01', '2023-02-15', '2023-03-20', '2023-04-10'],
    'date2': ['2023/01/01', '2023/02/15', '2023/03/20', '2023/04/10'],
    'date3': ['20230101', '20230215', '20230320', '20230410'],
    'datetime1': ['2023-01-01 09:00:00', '2023-02-15 14:30:00', 
                  '2023-03-20 18:45:00', '2023-04-10 22:15:00'],
    'timestamp': [1672531200, 1676462400, 1679328000, 1681142400]
}

df_date = pd.DataFrame(date_data)
print("원본 날짜 데이터:")
print(df_date.dtypes)

# 다양한 날짜 변환 방법
df_date['date1_converted'] = pd.to_datetime(df_date['date1'])
df_date['date2_converted'] = pd.to_datetime(df_date['date2'])
df_date['date3_converted'] = pd.to_datetime(df_date['date3'], format='%Y%m%d')
df_date['datetime1_converted'] = pd.to_datetime(df_date['datetime1'])
df_date['timestamp_converted'] = pd.to_datetime(df_date['timestamp'], unit='s')

print("\n변환 후 타입:")
print(df_date.select_dtypes(include=['datetime64']).dtypes)

# 날짜에서 정보 추출
df_date['year'] = df_date['date1_converted'].dt.year
df_date['month'] = df_date['date1_converted'].dt.month
df_date['day'] = df_date['date1_converted'].dt.day
df_date['weekday'] = df_date['date1_converted'].dt.day_name()

print("\n날짜 정보 추출:")
print(df_date[['date1_converted', 'year', 'month', 'day', 'weekday']])

불린형 (Boolean) 타입 변환

# 불린 데이터 예제
bool_data = {
    'text_bool': ['True', 'False', 'True', 'False'],
    'numeric_bool': [1, 0, 1, 0],
    'yn_bool': ['Y', 'N', 'Y', 'N'],
    'yes_no': ['Yes', 'No', 'Yes', 'No']
}

df_bool = pd.DataFrame(bool_data)

# 다양한 불린 변환
df_bool['text_bool_converted'] = df_bool['text_bool'].map({'True': True, 'False': False})
df_bool['numeric_bool_converted'] = df_bool['numeric_bool'].astype(bool)
df_bool['yn_bool_converted'] = df_bool['yn_bool'].map({'Y': True, 'N': False})
df_bool['yes_no_converted'] = df_bool['yes_no'].map({'Yes': True, 'No': False})

print("불린 변환 결과:")
print(df_bool)
print("\n변환 후 타입:")
print(df_bool.select_dtypes(include=['bool']).dtypes)

범주형 (Categorical) 타입 변환

# 범주형 데이터 예제
cat_data = {
    'grade': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
    'size': ['Small', 'Medium', 'Large', 'Small', 'Medium', 'Large', 'Small', 'Medium'],
    'satisfaction': ['매우불만', '불만', '보통', '만족', '매우만족', '만족', '보통', '불만']
}

df_cat = pd.DataFrame(cat_data)

# 일반 범주형 변환
df_cat['grade_cat'] = df_cat['grade'].astype('category')

# 순서가 있는 범주형 변환
df_cat['size_ordered'] = pd.Categorical(df_cat['size'], 
                                       categories=['Small', 'Medium', 'Large'], 
                                       ordered=True)

df_cat['satisfaction_ordered'] = pd.Categorical(df_cat['satisfaction'],
                                               categories=['매우불만', '불만', '보통', '만족', '매우만족'],
                                               ordered=True)

print("범주형 변환 결과:")
print(df_cat.dtypes)
print("\n메모리 사용량 비교:")
print(f"문자열: {df_cat['grade'].memory_usage(deep=True)} bytes")
print(f"범주형: {df_cat['grade_cat'].memory_usage(deep=True)} bytes")

# 범주형 데이터 정보
print(f"\n범주형 카테고리: {df_cat['size_ordered'].cat.categories}")
print(f"순서 여부: {df_cat['size_ordered'].cat.ordered}")

데이터 타입 변환 방법

기본 변환 메서드

# 기본 변환 메서드들
sample_df = pd.DataFrame({
    'A': ['1', '2', '3', '4'],
    'B': ['1.5', '2.5', '3.5', '4.5'],
    'C': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'D': ['True', 'False', 'True', 'False']
})

print("원본 데이터 타입:")
print(sample_df.dtypes)

# astype() 메서드 사용
sample_df['A_int'] = sample_df['A'].astype(int)
sample_df['B_float'] = sample_df['B'].astype(float)
sample_df['D_bool'] = sample_df['D'].astype(bool)

print("\nastype() 변환 후:")
print(sample_df.dtypes)

안전한 변환 방법

# 에러 처리가 포함된 안전한 변환
def safe_convert_numeric(series, target_type='float'):
    """안전한 숫자 변환 함수"""
    try:
        if target_type == 'int':
            return pd.to_numeric(series, errors='coerce').astype('Int64')
        else:
            return pd.to_numeric(series, errors='coerce')
    except Exception as e:
        print(f"변환 오류: {e}")
        return series

def safe_convert_datetime(series, format=None):
    """안전한 날짜 변환 함수"""
    try:
        if format:
            return pd.to_datetime(series, format=format, errors='coerce')
        else:
            return pd.to_datetime(series, errors='coerce')
    except Exception as e:
        print(f"날짜 변환 오류: {e}")
        return series

# 오류가 있는 데이터로 테스트
error_data = pd.DataFrame({
    'numbers': ['1', '2', 'abc', '4', '5.5'],
    'dates': ['2023-01-01', '2023-02-30', '2023-03-15', 'invalid', '2023-05-01']
})

print("오류 데이터:")
print(error_data)

# 안전한 변환 적용
error_data['numbers_safe'] = safe_convert_numeric(error_data['numbers'])
error_data['dates_safe'] = safe_convert_datetime(error_data['dates'])

print("\n안전한 변환 후:")
print(error_data)
print(f"\n숫자 변환 결측치: {error_data['numbers_safe'].isna().sum()}개")
print(f"날짜 변환 결측치: {error_data['dates_safe'].isna().sum()}개")

실제 활용 예제

CSV 파일 읽기 시 타입 지정

# CSV 파일 읽기 시 타입 지정 예제
# 실제 파일이 없으므로 예제 데이터로 시뮬레이션

# 예제 데이터 생성 및 저장
sample_data = {
    'user_id': [1001, 1002, 1003, 1004, 1005],
    'name': ['김철수', '이영희', '박민수', '정수진', '최지우'],
    'age': [25, 32, 28, 35, 29],
    'salary': ['3000000', '4500000', '3800000', '5200000', '4100000'],
    'join_date': ['2020-01-15', '2019-03-20', '2021-07-10', '2018-11-05', '2022-02-28'],
    'is_active': ['Y', 'N', 'Y', 'Y', 'N'],
    'department': ['개발', '마케팅', '개발', '인사', '마케팅']
}

sample_df = pd.DataFrame(sample_data)
sample_df.to_csv('sample_employee.csv', index=False)

# 기본 읽기 (모든 컬럼이 object 타입으로 읽힘)
df_default = pd.read_csv('sample_employee.csv')
print("기본 읽기 결과:")
print(df_default.dtypes)

# 타입 지정해서 읽기
dtype_dict = {
    'user_id': 'int64',
    'name': 'string',
    'age': 'int64',
    'salary': 'string',  # 나중에 변환할 예정
    'is_active': 'string',
    'department': 'category'
}

df_typed = pd.read_csv('sample_employee.csv', 
                       dtype=dtype_dict,
                       parse_dates=['join_date'])

print("\n타입 지정 읽기 결과:")
print(df_typed.dtypes)

# 추가 변환
df_typed['salary'] = pd.to_numeric(df_typed['salary'])
df_typed['is_active'] = df_typed['is_active'].map({'Y': True, 'N': False})

print("\n최종 변환 결과:")
print(df_typed.dtypes)
print(df_typed)

메모리 최적화

# 메모리 최적화 예제
def optimize_memory(df):
    """데이터프레임 메모리 최적화 함수"""
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"최적화 전 메모리 사용량: {start_mem:.2f} MB")
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                    
            elif str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
        
        # 문자열 컬럼을 범주형으로 변환 (유니크 값이 적은 경우)
        elif col_type == 'object':
            num_unique_values = len(df[col].unique())
            num_total_values = len(df[col])
            if num_unique_values / num_total_values < 0.5:
                df[col] = df[col].astype('category')
    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    print(f"최적화 후 메모리 사용량: {end_mem:.2f} MB")
    print(f"메모리 절약: {100 * (start_mem - end_mem) / start_mem:.1f}%")
    
    return df

# 큰 데이터셋 시뮬레이션
large_data = {
    'id': range(10000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 10000),
    'value': np.random.randint(0, 1000, 10000),
    'score': np.random.random(10000) * 100,
    'flag': np.random.choice([True, False], 10000)
}

large_df = pd.DataFrame(large_data)
print("최적화 전 데이터 타입:")
print(large_df.dtypes)

# 메모리 최적화 적용
large_df_optimized = optimize_memory(large_df.copy())
print("\n최적화 후 데이터 타입:")
print(large_df_optimized.dtypes)

복합 데이터 처리

# 실제 복잡한 데이터 처리 예제
complex_data = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'price': ['$1,299.99', '$899.50', '$2,199.00', '$599.99', '$1,799.95'],
    'rating': ['4.5/5', '3.8/5', '4.9/5', '4.2/5', '4.7/5'],
    'launch_date': ['2025-01-15', '2024-11-20', '2025-03-10', '2024-08-05', '2026-02-28'],
    'availability': ['In Stock', 'Out of Stock', 'In Stock', 'Limited', 'In Stock'],
    'dimensions': ['15.6" x 10.2" x 0.8"', '13.3" x 9.1" x 0.7"', 
                   '17.3" x 11.5" x 0.9"', '14.0" x 9.8" x 0.8"', '16.0" x 10.8" x 0.8"']
}

complex_df = pd.DataFrame(complex_data)
print("복합 데이터 원본:")
print(complex_df)

# 가격 데이터 처리
complex_df['price_numeric'] = (complex_df['price']
                              .str.replace('$', '')
                              .str.replace(',', '')
                              .astype(float))

# 평점 데이터 처리
complex_df['rating_numeric'] = (complex_df['rating']
                               .str.split('/')
                               .str[0]
                               .astype(float))

# 날짜 데이터 처리
complex_df['launch_date_dt'] = pd.to_datetime(complex_df['launch_date'])

# 가용성을 범주형으로 변환
availability_order = ['Out of Stock', 'Limited', 'In Stock']
complex_df['availability_cat'] = pd.Categorical(complex_df['availability'],
                                               categories=availability_order,
                                               ordered=True)

# 치수에서 화면 크기 추출
complex_df['screen_size'] = (complex_df['dimensions']
                            .str.extract(r'(\d+\.?\d*)"')
                            .astype(float))

print("\n변환 후 결과:")
print(complex_df[['price_numeric', 'rating_numeric', 'launch_date_dt', 
                  'availability_cat', 'screen_size']])
print("\n최종 데이터 타입:")
print(complex_df.dtypes)

데이터 타입 변환 시 주의사항

데이터 손실 방지

# 데이터 손실 예제
loss_data = pd.DataFrame({
    'decimal_numbers': [1.7, 2.9, 3.1, 4.8, 5.2],
    'large_numbers': [1234567890123, 9876543210987, 5555555555555, 1111111111111, 9999999999999]
})

print("원본 데이터:")
print(loss_data)

# 소수점 손실
loss_data['decimal_to_int'] = loss_data['decimal_numbers'].astype(int)
print("\n소수점 손실:")
print(loss_data[['decimal_numbers', 'decimal_to_int']])

# 정밀도 손실
loss_data['large_to_float32'] = loss_data['large_numbers'].astype(np.float32)
print("\n정밀도 손실:")
print(loss_data[['large_numbers', 'large_to_float32']])

# 안전한 변환 확인
def check_conversion_safety(original, converted):
    """변환 안전성 확인 함수"""
    if original.dtype.kind in 'fc' and converted.dtype.kind in 'i':
        # 실수를 정수로 변환하는 경우
        has_decimals = (original != original.astype(int)).any()
        if has_decimals:
            print("⚠️ 경고: 소수점 데이터가 손실됩니다!")
    
    if original.dtype.kind in 'fc' and converted.dtype.itemsize < original.dtype.itemsize:
        # 더 작은 크기로 변환하는 경우
        print("⚠️ 경고: 정밀도 손실이 발생할 수 있습니다!")

check_conversion_safety(loss_data['decimal_numbers'], loss_data['decimal_to_int'])

데이터 타입을 변환활 때 성능 고려사항

데이터 타입 변환은 일반적인 사칙연산보다 더 많은 연산 시간을 필요로 한다. 따라서 대용량 데이터를 변환할 때는 처리 시간이 상당히 증가하는 경우가 많다.

# 성능 비교 예제
import time

# 큰 데이터셋 생성
n = 1000000
perf_data = pd.DataFrame({
    'string_numbers': [str(i) for i in range(n)],
    'categories': np.random.choice(['A', 'B', 'C', 'D', 'E'], n)
})

# 숫자 변환 성능 비교
print("숫자 변환 성능 비교:")

# astype() 방법
start_time = time.time()
result1 = perf_data['string_numbers'].astype(int)
time1 = time.time() - start_time
print(f"astype(): {time1:.4f}초")

# pd.to_numeric() 방법
start_time = time.time()
result2 = pd.to_numeric(perf_data['string_numbers'])
time2 = time.time() - start_time
print(f"to_numeric(): {time2:.4f}초")

# 범주형 변환 성능 비교
print("\n범주형 변환 성능 비교:")

# 일반 문자열 메모리 사용량
mem_before = perf_data['categories'].memory_usage(deep=True)

# 범주형 변환
start_time = time.time()
perf_data['categories_cat'] = perf_data['categories'].astype('category')
cat_time = time.time() - start_time
mem_after = perf_data['categories_cat'].memory_usage(deep=True)

print(f"범주형 변환 시간: {cat_time:.4f}초")
print(f"메모리 절약: {(mem_before - mem_after) / mem_before * 100:.1f}%")

일관성 유지

# 데이터 일관성 확인 함수
def check_data_consistency(df):
    """데이터 타입 일관성 확인"""
    print("=== 데이터 타입 일관성 검사 ===")
    
    for col in df.columns:
        dtype = df[col].dtype
        print(f"\n컬럼: {col}")
        print(f"타입: {dtype}")
        
        if dtype == 'object':
            # 문자열 컬럼의 경우 숫자 변환 가능성 확인
            numeric_convertible = pd.to_numeric(df[col], errors='coerce').notna().sum()
            total_count = len(df[col].dropna())
            
            if numeric_convertible > 0:
                print(f"  ⚠️ 숫자로 변환 가능한 값: {numeric_convertible}/{total_count}")
                
            # 날짜 변환 가능성 확인
            try:
                date_convertible = pd.to_datetime(df[col], errors='coerce').notna().sum()
                if date_convertible > total_count * 0.5:  # 50% 이상이 날짜로 변환 가능
                    print(f"  ⚠️ 날짜로 변환 가능한 값: {date_convertible}/{total_count}")
            except:
                pass
                
        elif dtype.kind in 'fc':  # 숫자형
            print(f"  범위: {df[col].min()} ~ {df[col].max()}")
            
        elif dtype.name == 'category':
            print(f"  카테고리 수: {len(df[col].cat.categories)}")
            print(f"  카테고리: {list(df[col].cat.categories)}")

# 일관성 검사 예제
inconsistent_data = pd.DataFrame({
    'mixed_col': ['1', '2', 'three', '4', '5'],
    'date_like': ['2023-01-01', '2023-02-15', 'not_date', '2023-04-10', '2023-05-20'],
    'numeric_col': [1, 2, 3, 4, 5],
    'category_col': ['A', 'B', 'A', 'C', 'B']
})

check_data_consistency(inconsistent_data)

데이터 타입 변환 가이드라인

데이터 변환 작업도 복잡하고 작은 실수가 많기 때문에 가이드라인을 보고 따라하는 것이 좋다.

변환 순서

데이터 탐색: 각 컬럼의 내용과 형태 파악 타입 계획: 분석 목적에 맞는 적절한 타입 결정 안전 변환: 오류 처리를 포함한 변환 수행 검증: 변환 결과 확인 및 데이터 무결성 검사 최적화: 메모리 사용량 및 성능 최적화

상황별 변환 방법 선택

상황	권장 방법	이유
확실한 숫자 데이터	`astype()`	빠르고 간단
오류 가능성 있는 데이터	`pd.to_numeric(errors='coerce')`	안전한 변환
다양한 날짜 형식	`pd.to_datetime()`	자동 형식 인식
메모리 최적화 필요	범주형 변환	메모리 절약
순서가 중요한 범주	`pd.Categorical(ordered=True)`	순서 정보 보존

이런 것들이 있다는 것을 알아두자. AI를 이용해서 코드를 작성하기 때문에 암기할 필요는 없다.

마무리

데이터 타입 변환은 데이터 분석의 기초 중의 기초다. 올바른 타입으로 변환해야 정확한 분석이 가능하고, 메모리 효율성도 높일 수 있다.

데이터 타입 변환의 핵심은 **"데이터의 의미를 정확히 반영하는 타입 선택"**이다. 단순히 형태만 바꾸는 것이 아니라, 데이터가 어떻게 사용될지를 고려해야 한다.

💡 실무 팁
데이터 타입 변환은 항상 원본 데이터를 보존하고, 변환 과정을 단계별로 기록해두자. 특히 대용량 데이터에서는 메모리 효율성을 고려한 타입 선택이 중요하다.

데이터 타입 변환 - Data Type Conversion​

데이터 타입 문제 예시​

주요 데이터 타입​

데이터 타입 변환 방법​

실제 활용 예제​

데이터 타입 변환 시 주의사항​

데이터 타입 변환 가이드라인​

마무리​